ENH: ssh: Support specifying identity file via environment variable #3149

kyleam · 2019-02-07T21:19:16Z

This enables callers to use sshrun with non-standard identity files.
For example, in the ReproMan project, we want to use DataLad commands
with an AWS EC2 instance. In this case, the host will not be in
.ssh/config, and the key files are usually not in the default
location.

kyleam · 2019-02-07T21:50:18Z

Doh, my attempt at testing is probably failing due to the connection caching. Will have to revisit.

codecov · 2019-02-08T22:41:42Z

Codecov Report

Merging #3149 into master will decrease coverage by 44.79%.
The diff coverage is 38.88%.

@@            Coverage Diff             @@
##           master    #3149      +/-   ##
==========================================
- Coverage   89.47%   44.68%   -44.8%     
==========================================
  Files         248      248              
  Lines       32703    32717      +14     
==========================================
- Hits        29262    14619   -14643     
- Misses       3441    18098   +14657

Impacted Files	Coverage Δ
datalad/support/sshrun.py	`78.78% <ø> (-18.19%)`	⬇️
datalad/support/tests/test_sshconnector.py	`28.14% <33.33%> (-70.19%)`	⬇️
datalad/support/sshconnector.py	`59.51% <66.66%> (-25.15%)`	⬇️
datalad/support/tests/utils.py	`13.04% <0%> (-86.96%)`	⬇️
datalad/support/tests/test_stats.py	`13.11% <0%> (-86.89%)`	⬇️
datalad/support/tests/test_repodates.py	`13.46% <0%> (-86.54%)`	⬇️
datalad/cmdline/tests/test_formatters.py	`14.28% <0%> (-85.72%)`	⬇️
datalad/tests/test_protocols.py	`14.81% <0%> (-85.19%)`	⬇️
datalad/tests/test_dochelpers.py	`15.49% <0%> (-84.51%)`	⬇️
datalad/tests/test_config.py	`14.88% <0%> (-83.93%)`	⬇️
... and 201 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca3225e...3a48742. Read the comment docs.

codecov · 2019-02-08T22:41:42Z

Codecov Report

Merging #3149 into master will decrease coverage by 4.14%.
The diff coverage is 96.77%.

@@            Coverage Diff             @@
##           master    #3149      +/-   ##
==========================================
- Coverage   90.71%   86.56%   -4.15%     
==========================================
  Files         249      249              
  Lines       32723    32743      +20     
==========================================
- Hits        29684    28345    -1339     
- Misses       3039     4398    +1359

Impacted Files	Coverage Δ
datalad/interface/common_cfg.py	`100% <ø> (ø)`	⬆️
datalad/support/sshrun.py	`96.96% <ø> (ø)`	⬆️
datalad/support/sshconnector.py	`85.09% <100%> (+0.44%)`	⬆️
datalad/support/tests/test_sshconnector.py	`99.27% <95.45%> (+0.94%)`	⬆️
datalad/tests/utils_testdatasets.py	`11.76% <0%> (-88.24%)`	⬇️
datalad/interface/tests/test_annotate_paths.py	`24% <0%> (-76%)`	⬇️
datalad/plugin/tests/test_addurls.py	`59.92% <0%> (-40.08%)`	⬇️
datalad/distribution/tests/test_get.py	`60.72% <0%> (-39.28%)`	⬇️
datalad/metadata/tests/test_base.py	`61.37% <0%> (-37.94%)`	⬇️
...atalad/interface/tests/test_add_archive_content.py	`68.3% <0%> (-31.7%)`	⬇️
... and 59 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81bfe17...93c8651. Read the comment docs.

kyleam · 2019-02-11T20:13:06Z

Hmph, with the latest iteration (3a48742), I'm stumped on why test_ssh_manager_close fails with the addition of test_ssh_custom_identity_file. Seems like it'd have to be some sort of order/state dependence, but I haven't had luck figuring out what's going on.

yarikoptic · 2019-02-12T00:54:39Z

Could it be due to all already existing control paths we established in other tests?

kyleam · 2019-02-12T18:27:54Z

Could it be due to all already existing control paths we established in other tests?

Right, that was along the lines of what I was thinking when I wrote "some sort of order/state dependence". The problem is that, even if the new test were using the same socket (it's not) and if it weren't closing it (it is), I don't see how test_ssh_manager_close would fail the way that it does---saying that the path for the datalad-test socket doesn't exist.

There's Travis-specific setup for these tests, so I've been restricting my local debugging to test_ssh_custom_identity_file (with an adjustment to its ifile variable). I'm next going to try to get all the tests in test_sshconnector.py running locally to see if I can trigger the issue and step through the failing test_ssh_manager_close.

kyleam · 2019-02-12T20:20:48Z

OK, so it looks like the issue is that, after patching os.environs DATALAD_SSH_IDENTITYFILE, the value leaks into the other tests because get_connection was only doing a cfg.reload(), not cfg.reload(force=True).

So I guess the main question is who should be responsible for calling reload(force=True), SSHManager or its creators? If the answer is SSHManager, then there is the follow-up question of where SSHManager should do it, get_connection or just at initialization (__init__ or assure_initialized)?

I'm leaning towards making SSHManager do it in assure_initialized because that function already imports ConfigManager (not datalad.cfg), so it essentially does a reload(force=True) and we can piggyback on it without an added expense. I'll do that unless someone argues otherwise.

This test was written at a time when the socket path basename was simply the plain hostname. The basename has been a hash of various connection details since 5be560a (BF+ENH: Hash-based unique ControlPaths for SSH (fixes dataladgh-1243), 2017-01-31).

Upcoming commits will make it possible to configure an identity file to pass to ssh's -i. Consider the identity file when creating the connection hash because it's a defining feature of the connection.

The next commit will make this accessible to outside callers by adjusting SSHManager and exposing a configuration variable.

This enables callers to use sshrun with non-standard identity files. For example, in the ReproMan project, we want to use DataLad commands with an AWS EC2 instance. In this case, the host will not be in .ssh/config, and the key files are usually not in the default location. A plain environment variable would do for the above usecase, but add it as a variable in common_cfg.py for the visibility and documentation.

kyleam · 2019-02-12T20:51:56Z

I'm leaning towards making SSHManager do it in assure_initialized because [...]

Done with latest push.

range-diff

4:  c89249d29 = 1:  9cf3e0f2d TST+BF: sshconnector: Fix stale socket path construction
1:  f0377e599 = 2:  e9009eced ENH: ssh: Add identity_file parameter to get_connection_hash()
2:  30e7e18af = 3:  dc3226f5e ENH: SSHConnection: Support custom identity files
3:  8bf6a2ce9 ! 4:  93c86517b ENH: ssh: Support configuring an identity file
    @@ -32,21 +32,44 @@
      diff --git a/datalad/support/sshconnector.py b/datalad/support/sshconnector.py
      --- a/datalad/support/sshconnector.py
      +++ b/datalad/support/sshconnector.py
    +@@
    + class SSHManager(object):
    +     """Keeps ssh connections to share. Serves singleton representation
    +     per connection.
    ++
    ++    Note: If a connection should use a custom identity file via
    ++    `datalad.ssh.identityfile`, that value should be set before this class is
    ++    initialized.
    +     """
    + 
    +     def __init__(self):
    +@@
    +         # handling of socket_dir, so we do not define them here explicitly
    +         # to an empty list to fail if logic is violated
    +         self._prev_connections = None
    ++        self._identity_file = None
    +         # and no explicit initialization in the constructor
    +         # self.assure_initialized()
    + 
    +@@
    +         cfg = ConfigManager()
    +         self._socket_dir = opj(cfg.obtain('datalad.locations.cache'),
    +                                'sockets')
    ++        self._identity_file = cfg.get('datalad.ssh.identityfile')
    +         assure_dir(self._socket_dir)
    +         try:
    +             chmod(self._socket_dir, 0o700)
     @@
                  raise ValueError("Unsupported SSH URL: '{0}', use "
                                   "ssh://host/path or host:path syntax".format(url))
      
    -+        from datalad import cfg
    -+        # Reload because Python callers may have set environment variable after
    -+        # importing datalad. `force` needs to be true so that temporary
    -+        # environmental variables don't linger.
    -+        cfg.reload(force=True)
    ++        # Ensure self._identityfile has been set, if configured.
    ++        self.assure_initialized()
     +
    -+        identity_file = cfg.get('datalad.ssh.identityfile')
              conhash = get_connection_hash(
                  sshri.hostname,
                  port=sshri.port,
    -+            identity_file=identity_file or "",
    ++            identity_file=self._identity_file or "",
                  username=sshri.username)
              # determine control master:
              ctrl_path = "%s/%s" % (self.socket_dir, conhash)
    @@ -55,7 +78,8 @@
                  return self._connections[ctrl_path]
              else:
     -            c = SSHConnection(ctrl_path, sshri)
    -+            c = SSHConnection(ctrl_path, sshri, identity_file=identity_file)
    ++            c = SSHConnection(ctrl_path, sshri,
    ++                              identity_file=self._identity_file)
                  self._connections[ctrl_path] = c
                  return c

kyleam · 2019-02-15T21:17:33Z

I'm leaning towards making SSHManager do it in assure_initialized

Eh after hooking up this latest approach to ReproMan, I'm not sure I made the right decision (I should have tested earlier). I was hoping that using assure_initialized would mean that the config reading would be delayed until datalad.ssh_managers first get_connection call (the only other spot where SSHManager calls assure_initialized is _socket_dir). But GitRepo.__init__() also calls assure_initialized. This means that if any external python callers want to set the environment variable to override this, they need to do it before an GitRepo initialization/use (e.g., using Dataset.id), not just before anything that might call get_connection.

So the options I see are

Call config.reload(force=True) for each get_connection() call. That's the expense that the above approach was trying to avoid, but I'm not sure how bad it is in practice.
Go back to the first iteration of this PR and just check os.environ, dropping the common_cfg entry.
Do nothing. It doesn't seem that bad to say the python callers just need to set up the environment early, but depending on how the program is structured and the task, it might be difficult or impossible for it to set the environment variables before it initializes a GitRepo. And I don't believe there is currently an easy way for the caller to explicitly ask ssh_manager to refresh its info.

@yarikoptic, do you have a preference or see another, better option?

kyleam · 2019-02-15T21:34:18Z

Another option (currently my preference):

Adjust get_connection to look up the identity file with datalad.config.get (no .reload before). Document that if python callers have modified the config/environment after loading datalad and want to ensure that things are up to date, they should call datalad.config.reload(force=True).

yarikoptic · 2019-02-15T22:04:38Z

... if any external python callers want to set the environment variable...

"external" -- aren't you are using datalad Python API in reproman? shouldn't it then just change config directly (from datalad import cfg)? Then assure_initialized could memoize on either it was initialized with the current setting of the datalad.ssh.identityfile, and if value changes - reinitialize. Then you could even jump between different identities within the same process. (that is the "solution" I had in mind when I was leaving)

kyleam · 2019-02-15T22:35:28Z

"external" -- aren't you are using datalad Python API in reproman?

Yes.

shouldn't it then just change config directly (from datalad import cfg)?

Regardless of what my specific code does, I was under the assumption that we wanted to support Python callers setting environment variables. That's the reason for the whole reload(force=True) discussion in the first place, but that was probably a silly assumption on my part. I guess if that were the case, we'd need cfg.reload(force=True) sprinkled all over the codebase.

So I don't mind going in the direction you suggest, though I'd prefer the last option I suggested, which would accomplish the same thing without the need for SSHManager to have an _identity_file attribute.

yarikoptic · 2019-02-15T22:43:05Z

whatever you like the best. I do not think though it is really bad to store _identity_file attribute to provide more of automated handling, but may be I do not see possible side-effects (e.g. if we initiate some connection with out the identity_file, then need to initiate another with the identityfile -- would initialization kill existing sockets/connections? didn't check the code)

kyleam · 2019-02-15T23:03:12Z

I do not think though it is really bad to store _identity_file attribute to provide more of automated handling

I'm failing to see what the extra automation is. Either way you have a config.get call. By doing it in get_connection you do it closer to the only place the value is currently used, don't need the extra code for updating the attribute, and avoid changing the way the current assure_initialized works.

yarikoptic · 2019-02-16T01:36:45Z

Regarding your memory about needing it in environment variable... We might indeed need that (possibly to set it if not yet set while "treating config variable") because if we have a datalad special remote initiated by some annex call, it might be needed. But I don't think we have a specific use case for that yet I think.

This reverts and replaces the sshconnector.py changes from 93c8651 (ENH: ssh: Support configuring an identity file, 2019-02-07). 93c8651 updated assure_initialized() to get datalad.ssh.identityfile from its local ConfigManager instance and store that value as an attribute for later use. The motivation for retrieving the value in assure_initialized() was to avoid the expense of calling datalad.cfg.reload(force=True) with each get_connection() call. In turn, the motivation for using reload(force=True) was to prevent test_ssh_custom_identity_file's patched DATALAD_SSH_IDENTITYFILE from leaking into other tests [*]. The problem with this approach is that Python callers must set the identity file before the first assure_initialized() call---which could happen in SSHManager.get_connection() or _much earlier_ because GitRepo.__init__() also calls assure_initialized(). After the first assure_initialized() call, callers don't have a straightforward way of changing the setting. Instead let's use a plain datalad.cfg.get() call, without reloading, within get_connection(). This gives Python callers the ability to update the setting before _any_ get_connection() call, but they are now responsible for reloading datalad.cfg if needed. [*]: datalad#3149 (comment) Re: datalad#3149

## 0.11.3 (Feb 19, 2019) -- read-me-gently Just a few of important fixes and minor enhancements. ### Fixes - The logic for setting the maximum command line length now works around Python 3.4 returning an unreasonably high value for `SC_ARG_MAX` on Debian systems. ([#3165]) - DataLad commands that are conceptually "read-only", such as `datalad ls -L`, can fail when the caller lacks write permissions because git-annex tries merging remote git-annex branches to update information about availability. DataLad now disables `annex.merge-annex-branches` in some common "read-only" scenarios to avoid these failures. ([#3164]) ### Enhancements and new features - Accessing an "unbound" dataset method now automatically imports the necessary module rather than requiring an explicit import from the Python caller. For example, calling `Dataset.add` no longer needs to be preceded by `from datalad.distribution.add import Add` or an import of `datalad.api`. ([#3156]) - Configuring the new variable `datalad.ssh.identityfile` instructs DataLad to pass a value to the `-i` option of `ssh`. ([#3149]) ([#3168]) * tag '0.11.3': (35 commits) ENH: Finalized changelog entry for 0.11.3 [DATALAD RUNCMD] CHANGELOG: Re-linkify 0.11.3 entries CHANGELOG: Mention #3168 ENH: sshconnector: Get identity file with datalad.cfg.get() [DATALAD RUNCMD] CHANGELOG: Linkify 0.11.3 entries CHANGELOG: Add entries for 0.11.3 BF/RF(TST): skip test if actual sudo chown call fails BF(TMP): declare check_datasets_datalad_org failing on windows BF(TST): replace not relevant trailing .pull test with .repo_info BF(BK): for some reason an exception on repo_info invocation isn't raised while on travis TST: test_ro_operations via sudo (when possible) BF: allow to disallow git-annex branch merges for repo_info and use that in ls BF(unicode): use assure_unicode while formatting an exception BF(PY3): workaround for python3.4 on debian returning obnoxious SC_ARG_MAX BF: revert change for repo_info about not merging remote git-annex BF: set annex.merge-annex-branches=false for invocations not requiring updated remote information ENH: ssh: Support configuring an identity file ENH: SSHConnection: Support custom identity files ENH: ssh: Add identity_file parameter to get_connection_hash() TST+BF: sshconnector: Fix stale socket path construction ...

## 0.11.3 (Feb 19, 2019) -- read-me-gently Just a few of important fixes and minor enhancements. ### Fixes - The logic for setting the maximum command line length now works around Python 3.4 returning an unreasonably high value for `SC_ARG_MAX` on Debian systems. ([#3165]) - DataLad commands that are conceptually "read-only", such as `datalad ls -L`, can fail when the caller lacks write permissions because git-annex tries merging remote git-annex branches to update information about availability. DataLad now disables `annex.merge-annex-branches` in some common "read-only" scenarios to avoid these failures. ([#3164]) ### Enhancements and new features - Accessing an "unbound" dataset method now automatically imports the necessary module rather than requiring an explicit import from the Python caller. For example, calling `Dataset.add` no longer needs to be preceded by `from datalad.distribution.add import Add` or an import of `datalad.api`. ([#3156]) - Configuring the new variable `datalad.ssh.identityfile` instructs DataLad to pass a value to the `-i` option of `ssh`. ([#3149]) ([#3168]) * tag '0.11.3': BF(TST): skip if windows OR root; catch any exception

## 0.11.3 (Feb 19, 2019) -- read-me-gently Just a few of important fixes and minor enhancements. ### Fixes - The logic for setting the maximum command line length now works around Python 3.4 returning an unreasonably high value for `SC_ARG_MAX` on Debian systems. ([#3165]) - DataLad commands that are conceptually "read-only", such as `datalad ls -L`, can fail when the caller lacks write permissions because git-annex tries merging remote git-annex branches to update information about availability. DataLad now disables `annex.merge-annex-branches` in some common "read-only" scenarios to avoid these failures. ([#3164]) ### Enhancements and new features - Accessing an "unbound" dataset method now automatically imports the necessary module rather than requiring an explicit import from the Python caller. For example, calling `Dataset.add` no longer needs to be preceded by `from datalad.distribution.add import Add` or an import of `datalad.api`. ([#3156]) - Configuring the new variable `datalad.ssh.identityfile` instructs DataLad to pass a value to the `-i` option of `ssh`. ([#3149]) ([#3168]) * tag '0.11.3': Boost version to 0.11.3

kyleam added the WIP work in progress label Feb 7, 2019

kyleam force-pushed the ssh-identify-env branch from 59e2a39 to 3a48742 Compare February 8, 2019 22:41

yarikoptic added this to the Release 0.11.3 milestone Feb 10, 2019

kyleam force-pushed the ssh-identify-env branch 2 times, most recently from 511acbc to 3a48742 Compare February 11, 2019 20:01

kyleam force-pushed the ssh-identify-env branch from 88e7281 to c89249d Compare February 12, 2019 19:59

kyleam added 4 commits February 12, 2019 15:44

ENH: ssh: Add identity_file parameter to get_connection_hash()

e9009ec

Upcoming commits will make it possible to configure an identity file to pass to ssh's -i. Consider the identity file when creating the connection hash because it's a defining feature of the connection.

ENH: SSHConnection: Support custom identity files

dc3226f

The next commit will make this accessible to outside callers by adjusting SSHManager and exposing a configuration variable.

kyleam force-pushed the ssh-identify-env branch from c89249d to 93c8651 Compare February 12, 2019 20:51

kyleam removed the WIP work in progress label Feb 12, 2019

yarikoptic merged commit cf3f899 into datalad:master Feb 15, 2019

kyleam deleted the ssh-identify-env branch February 15, 2019 19:22

kyleam mentioned this pull request Feb 15, 2019

Prepare 0.11.3 release #3167

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: ssh: Support specifying identity file via environment variable #3149

ENH: ssh: Support specifying identity file via environment variable #3149

kyleam commented Feb 7, 2019

kyleam commented Feb 7, 2019

codecov bot commented Feb 8, 2019

codecov bot commented Feb 8, 2019 •

edited

kyleam commented Feb 11, 2019

yarikoptic commented Feb 12, 2019

kyleam commented Feb 12, 2019

kyleam commented Feb 12, 2019

kyleam commented Feb 12, 2019

kyleam commented Feb 15, 2019 •

edited

kyleam commented Feb 15, 2019

yarikoptic commented Feb 15, 2019 •

edited

kyleam commented Feb 15, 2019

yarikoptic commented Feb 15, 2019

kyleam commented Feb 15, 2019

yarikoptic commented Feb 16, 2019 •

edited

ENH: ssh: Support specifying identity file via environment variable #3149

ENH: ssh: Support specifying identity file via environment variable #3149

Conversation

kyleam commented Feb 7, 2019

kyleam commented Feb 7, 2019

codecov bot commented Feb 8, 2019

Codecov Report

codecov bot commented Feb 8, 2019 • edited

Codecov Report

kyleam commented Feb 11, 2019

yarikoptic commented Feb 12, 2019

kyleam commented Feb 12, 2019

kyleam commented Feb 12, 2019

kyleam commented Feb 12, 2019

kyleam commented Feb 15, 2019 • edited

kyleam commented Feb 15, 2019

yarikoptic commented Feb 15, 2019 • edited

kyleam commented Feb 15, 2019

yarikoptic commented Feb 15, 2019

kyleam commented Feb 15, 2019

yarikoptic commented Feb 16, 2019 • edited

codecov bot commented Feb 8, 2019 •

edited

kyleam commented Feb 15, 2019 •

edited

yarikoptic commented Feb 15, 2019 •

edited

yarikoptic commented Feb 16, 2019 •

edited